Quick Presets:
Highest Precision
High Risk + File Context (Precision)
Best for Complex Code (Precision)
Python + Medium Risk (Precision)
Best for Concurrency (Precision)
Typescript + Scheduling (Recall)
Best for Performance Optimization
Best for Bug Fixes (Recall)
Best for Small Go PRs
Java + Authentication
Small PRs + Performance Optimization (Precision)
Best for Medium Ruby PRs
Best for Bug Fixes
Typescript + Correctness
Ruby + Medium PRs (Recall)
Best for Ui
Bug Fixes + Cross-File
Best for Reliability
Best for Concurrency
Ruby + Correctness
Ruby + Correctness
Best for Caching
Best for Go
Best for Small PRs
Best for Scheduling
Best for File Context
Best for Security
Security Critical
Best for High Risk
Best for Python
Best for Medium Python PRs
Best for Moderate Bugs
Highest Recall
Best for Authentication
High Risk Auth
Best for Critical Risk
Best for Medium Java PRs
Best for Features
Best for Moderate Code
Best for Complex Code
Complex & Subtle
Best for Correctness
Best for Java
Best for Large PRs
Highest F1
Best for Typescript
Best for Subtle Bugs
Best for Cross-File
Best for Medium PRs
Best for Medium Risk
Best for Reliability
Best for Concurrency
Best for Ruby
CURRENT RESULTS
All Languages
Performance Metrics
# Tool Precision (%) Recall (%) F1 Score (%) True Positives PRs Evaluated
F1 Score by Tool

Repositories Used

The offline benchmark draws from a diverse set of open-source repositories spanning different languages, frameworks, and domains — from infrastructure and observability tools to web platforms and security projects.

This variety ensures our results reflect how AI reviewers perform across real-world codebases, not just one type of software.